We want to investigate the avocado dataset, and, in particular, to model the AveragePrice of the avocados. Use the tools we’ve worked with this week in order to prepare your dataset and find appropriate predictors. Once you’ve built your model use the validation techniques discussed on Wednesday to evaluate it. Feel free to focus either on building an explanatory or a predictive model, or both if you are feeling energetic!
As part of the MVP we want you not to just run the code but also have a go at interpreting the results and write your thinking in comments in your script.
Hints and tips
region may lead to many dummy variables. Think carefully about whether to include this variable or not (there is no one ‘right’ answer to this!)Date will not be needed in your models, but can you extract any useful features out of Date before you discard it?leaps or glmulti to help with this.Here is what we found looking for information on the ‘avocado’ data. I am accepting this info as reliable.
“The table represents weekly retail scan data for National retail volume (units) and price. Retail scan data comes directly from retailers’ cash registers based on actual retail sales of Hass avocados. Starting in 2013, the table below reflects an expanded, multi-outlet retail data set. Multi-outlet reporting includes an aggregation of the following channels: grocery, mass, club, drug, dollar and military. The Average Price (of avocados) in the table reflects a per unit (per avocado) cost, even when multiple units (avocados) are sold in bags. The Product Lookup codes (PLU’s) in the table are only for Hass avocados. Other varieties of avocados (e.g. greenskins) are not included in this table.”
Relevant info for understanding ‘obscure’ variable names:
AveragePrice - the average price of a single avocado Region - the city or region of the observation, i.e. where avocados were sold. Total Volume - Total number of avocados sold 4046 - Total number of small avocados sold (PLU 4046) 4225 - Total number of medium avocados sold (PLU 4225) 4770 - Total number of large avocados sold (PLU 4770)
Apparently average price recorded here is not related to bag size so we can drop these variables. Although region may have an impact on price we have decided to drop ‘region’ when doing manual model development. Instead, we will keep region when testing and automated model development.
the x1 variable records the week in which sales were recorded in a 52 weeks per year format. Although our brief is not interested in time series and forecasting we can investigate if seasonality has an impact on average price. Avocados are very sensitive to variations in temperature so weather patterns may impact production and potentially prices. We have decided to keep only data for years 2015 - 2017 dropping partial 2018 data. This could help especially if seasons play some role on average price.
So, we’ll focus on average price, type and total volume. We’ll use x1, date and year to engineer variables which will enable us to explore seasonality.
One line conclusion: Weather, especially around October, can have an impact on supply which in turn will influence avocado prices.
Afterthoughts: Thinking carefully about the data and asking the right questions will help with variable engineering and as a result modelling accuracy and outcomes. The ability to run multiple models in a short time helps with this ‘go between’ process and hopefully increases both data value and understanding which may lead to informed quantitative decision making.
library(tidyverse)
library(janitor)
library(ggfortify)
library(GGally)
library(lubridate)
library(modelr)
library(skimr)
avocado_df_exp <- read_csv("data/avocado.csv") %>%
clean_names() %>%
select(x1:x4770, type:year) %>%
rename(week = "x1",
small = "x4046",
medium = "x4225",
large = "x4770") %>%
filter(date <= "2017-12-31")
## Warning: Missing column names filled in: 'X1' [1]
##
## -- Column specification --------------------------------------------------------
## cols(
## X1 = col_double(),
## Date = col_date(format = ""),
## AveragePrice = col_double(),
## `Total Volume` = col_double(),
## `4046` = col_double(),
## `4225` = col_double(),
## `4770` = col_double(),
## `Total Bags` = col_double(),
## `Small Bags` = col_double(),
## `Large Bags` = col_double(),
## `XLarge Bags` = col_double(),
## type = col_character(),
## year = col_double(),
## region = col_character()
## )
avocado_tidy <- avocado_df_exp %>%
mutate(month = as.character(month(date))) %>%
mutate(season = case_when(
month == "12" | month == "1" | month == "2" ~ "winter",
month == "3" | month == "4" | month == "5" ~ "spring",
month == "6" | month == "7" | month == "8" ~ "summer",
month == "9" | month == "10" | month == "11" ~ "autumn")
) %>%
mutate(type = as.factor(type)) %>%
mutate(season = as.factor(season)) %>%
mutate(year = as.factor(year)) %>%
#mutate(week = as.factor(week))
select(-date)
We expect total volume to be strongly correlated to avocado sizes so we test and if confirmed drop avocado sizes variables.
avocado_tidy %>%
select(total_volume:large) %>%
ggpairs()
It is clear we can use total volume as the ‘size’ variable in our analsysis.
avocado_tidy <- avocado_tidy %>%
select(-c(small, medium, large))
Let’s look at summary statistics. We’ll employ both summary() and skim() functions to compare their different output formats.
summary(avocado_tidy)
## week average_price total_volume type
## Min. : 0.00 Min. :0.44 Min. : 85 conventional:8478
## 1st Qu.:13.00 1st Qu.:1.10 1st Qu.: 10460 organic :8475
## Median :26.00 Median :1.37 Median : 104849
## Mean :25.66 Mean :1.41 Mean : 834110
## 3rd Qu.:39.00 3rd Qu.:1.67 3rd Qu.: 423186
## Max. :52.00 Max. :3.25 Max. :61034457
## year month season
## 2015:5615 Length:16953 autumn:4212
## 2016:5616 Class :character spring:4320
## 2017:5722 Mode :character summer:4210
## winter:4211
##
##
avocado_tidy %>%
skim()
| Name | Piped data |
| Number of rows | 16953 |
| Number of columns | 7 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| factor | 3 |
| numeric | 3 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| month | 0 | 1 | 1 | 2 | 0 | 12 | 0 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| type | 0 | 1 | FALSE | 2 | con: 8478, org: 8475 |
| year | 0 | 1 | FALSE | 3 | 201: 5722, 201: 5616, 201: 5615 |
| season | 0 | 1 | FALSE | 4 | spr: 4320, aut: 4212, win: 4211, sum: 4210 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| week | 0 | 1 | 25.66 | 15.11 | 0.00 | 13.00 | 26.00 | 39.00 | 52.00 | ▇▇▇▇▇ |
| average_price | 0 | 1 | 1.41 | 0.41 | 0.44 | 1.10 | 1.37 | 1.67 | 3.25 | ▂▇▅▁▁ |
| total_volume | 0 | 1 | 834109.85 | 3381120.41 | 84.56 | 10459.56 | 104849.39 | 423186.06 | 61034457.10 | ▇▁▁▁▁ |
`total volume’ is extremely skewed so this may affect our models. We need to look into this.
total_vol_by_type <- avocado_tidy %>%
group_by(type) %>%
summarise(avg_total_vol= mean(total_volume)) %>%
mutate(pct = prop.table(avg_total_vol) * 100)
total_vol_by_type
More than 97 % of avocados in the data is conventional. It makes sense to focus on this type for average price modelling (for comparison we have provided manual modelling on a separate notebook) .
avocado_tidy_conv <- avocado_tidy %>%
filter(type == "conventional") %>%
select(-type)
avocado_tidy_org <- avocado_tidy %>%
filter(type == "organic") %>%
select(-type)
both_types <- ggplot(avocado_tidy) +
aes(x = total_volume, y = average_price) +
geom_point(size = 1L, colour = "#0c4c8a") +
geom_smooth(span = 0.75) +
scale_x_continuous(trans = "log") +
scale_y_continuous(trans = "log") +
labs(title = "Average price decreases when Total Volume increseas",
subtitle = "both types") +
theme_minimal()
conventional <- ggplot(avocado_tidy_conv) +
aes(x = total_volume, y = average_price) +
geom_point(size = 1L, colour = "#0c4c8a") +
geom_smooth(span = 0.75) +
scale_x_continuous(trans = "log") +
scale_y_continuous(trans = "log") +
labs(title = "Average price decreases when Total Volume increseas",
subtitle = "conventional") +
theme_minimal()
organic <- ggplot(avocado_tidy_org) +
aes(x = total_volume, y = average_price) +
geom_point(size = 1L, colour = "#0c4c8a") +
geom_smooth(span = 0.75) +
scale_x_continuous(trans = "log") +
scale_y_continuous(trans = "log") +
labs(title = "Average price decreases when Total Volume increseas",
subtitle = "organic") +
theme_minimal()
both_types
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
conventional
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
organic
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
ggplot(avocado_df_exp) +
aes(x = date, y = average_price, colour = type) +
geom_line(size = 1L) +
scale_color_hue() +
labs(title = "Average Price has a certain degree of seasonality") +
theme_minimal() +
facet_wrap(vars(type))
ggplot(avocado_df_exp) +
aes(x = type, y = average_price, fill = type) +
geom_boxplot() +
scale_fill_hue() +
labs(title = "As expected average price is higher for organic type") +
theme_minimal()
ggplot(avocado_df_exp) +
aes(x = date, weight = total_volume) +
geom_bar(fill = "#0c4c8a") +
labs(title = "Total Volume has also a pattern of seasonality") +
theme_minimal()
avocado_tidy_conv %>%
ggpairs(aes(colour = season, alpha = 0.5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
mod_total_volume <- lm(average_price ~ log(total_volume), data = avocado_tidy_conv)
mod_total_volume
##
## Call:
## lm(formula = average_price ~ log(total_volume), data = avocado_tidy_conv)
##
## Coefficients:
## (Intercept) log(total_volume)
## 1.75676 -0.04544
summary(mod_total_volume)
##
## Call:
## lm(formula = average_price ~ log(total_volume), data = avocado_tidy_conv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.65584 -0.18583 -0.02832 0.15142 1.06102
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.756763 0.027874 63.03 <2e-16 ***
## log(total_volume) -0.045444 0.002113 -21.51 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2599 on 8476 degrees of freedom
## Multiple R-squared: 0.05175, Adjusted R-squared: 0.05164
## F-statistic: 462.6 on 1 and 8476 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(mod_total_volume)
mod_month <- lm(average_price ~ month, data = avocado_tidy_conv)
mod_month
##
## Call:
## lm(formula = average_price ~ month, data = avocado_tidy_conv)
##
## Coefficients:
## (Intercept) month10 month11 month12 month2 month3
## 1.03694 0.31239 0.16911 0.04045 -0.03765 0.08790
## month4 month5 month6 month7 month8 month9
## 0.10541 0.05263 0.11225 0.17554 0.19845 0.25779
summary(mod_month)
##
## Call:
## lm(formula = average_price ~ month, data = avocado_tidy_conv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.71474 -0.17605 -0.02249 0.16751 0.87066
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.036944 0.009007 115.125 < 2e-16 ***
## month10 0.312394 0.012738 24.525 < 2e-16 ***
## month11 0.169110 0.012981 13.028 < 2e-16 ***
## month12 0.040449 0.012981 3.116 0.00184 **
## month2 -0.037654 0.013258 -2.840 0.00452 **
## month3 0.087899 0.012981 6.772 1.36e-11 ***
## month4 0.105406 0.012981 8.120 5.31e-16 ***
## month5 0.052632 0.012738 4.132 3.63e-05 ***
## month6 0.112253 0.013258 8.467 < 2e-16 ***
## month7 0.175542 0.012738 13.781 < 2e-16 ***
## month8 0.198454 0.012981 15.288 < 2e-16 ***
## month9 0.257793 0.013258 19.444 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2477 on 8466 degrees of freedom
## Multiple R-squared: 0.1397, Adjusted R-squared: 0.1386
## F-statistic: 125 on 11 and 8466 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(mod_month)
mod_week <- lm(average_price ~ week, data = avocado_tidy_conv)
mod_week
##
## Call:
## lm(formula = average_price ~ week, data = avocado_tidy_conv)
##
## Coefficients:
## (Intercept) week
## 1.263930 -0.004035
summary(mod_week)
##
## Call:
## lm(formula = average_price ~ week, data = avocado_tidy_conv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.77393 -0.18287 -0.02453 0.16081 1.00450
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.2639300 0.0055623 227.23 <2e-16 ***
## week -0.0040355 0.0001867 -21.61 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2598 on 8476 degrees of freedom
## Multiple R-squared: 0.05221, Adjusted R-squared: 0.0521
## F-statistic: 467 on 1 and 8476 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(mod_week)
mod_season <- lm(average_price ~ season, data = avocado_tidy_conv)
mod_season
##
## Call:
## lm(formula = average_price ~ season, data = avocado_tidy_conv)
##
## Coefficients:
## (Intercept) seasonspring seasonsummer seasonwinter
## 1.28478 -0.16659 -0.08413 -0.24594
summary(mod_season)
##
## Call:
## lm(formula = average_price ~ season, data = avocado_tidy_conv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.70478 -0.17819 -0.02065 0.16181 0.93522
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.284777 0.005463 235.18 <2e-16 ***
## seasonspring -0.166587 0.007677 -21.70 <2e-16 ***
## seasonsummer -0.084126 0.007726 -10.89 <2e-16 ***
## seasonwinter -0.245935 0.007726 -31.83 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2507 on 8474 degrees of freedom
## Multiple R-squared: 0.1176, Adjusted R-squared: 0.1173
## F-statistic: 376.3 on 3 and 8474 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(mod_season)
mod_year <- lm(average_price ~ year, data = avocado_tidy_conv)
mod_year
##
## Call:
## lm(formula = average_price ~ year, data = avocado_tidy_conv)
##
## Coefficients:
## (Intercept) year2016 year2017
## 1.07796 0.02763 0.21693
summary(mod_year)
##
## Call:
## lm(formula = average_price ~ year, data = avocado_tidy_conv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.83489 -0.15559 -0.00559 0.15441 1.09441
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.077963 0.004694 229.663 < 2e-16 ***
## year2016 0.027632 0.006638 4.163 3.18e-05 ***
## year2017 0.216925 0.006606 32.835 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2487 on 8475 degrees of freedom
## Multiple R-squared: 0.1314, Adjusted R-squared: 0.1312
## F-statistic: 640.8 on 2 and 8475 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(mod_year)
remaining_resid <- avocado_tidy_conv %>%
add_residuals(mod_month) %>%
select(-c(average_price, month))
remaining_resid %>%
ggpairs(aes(colour = season, alpha = 0.5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
mod_month_total_volume <- lm(average_price ~ month + log(total_volume), data = avocado_tidy_conv)
mod_month_total_volume
##
## Call:
## lm(formula = average_price ~ month + log(total_volume), data = avocado_tidy_conv)
##
## Coefficients:
## (Intercept) month10 month11 month12
## 1.58425 0.30265 0.15852 0.03508
## month2 month3 month4 month5
## -0.03444 0.08558 0.10551 0.05772
## month6 month7 month8 month9
## 0.11515 0.17562 0.19540 0.25206
## log(total_volume)
## -0.04154
summary(mod_month_total_volume)
##
## Call:
## lm(formula = average_price ~ month + log(total_volume), data = avocado_tidy_conv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.66874 -0.17550 -0.02301 0.15789 0.87952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.58425 0.02741 57.807 < 2e-16 ***
## month10 0.30265 0.01243 24.357 < 2e-16 ***
## month11 0.15852 0.01266 12.518 < 2e-16 ***
## month12 0.03508 0.01266 2.772 0.00558 **
## month2 -0.03444 0.01293 -2.665 0.00772 **
## month3 0.08558 0.01265 6.763 1.44e-11 ***
## month4 0.10551 0.01265 8.338 < 2e-16 ***
## month5 0.05772 0.01242 4.648 3.41e-06 ***
## month6 0.11515 0.01293 8.909 < 2e-16 ***
## month7 0.17563 0.01242 14.144 < 2e-16 ***
## month8 0.19540 0.01265 15.442 < 2e-16 ***
## month9 0.25206 0.01293 19.499 < 2e-16 ***
## log(total_volume) -0.04154 0.00197 -21.082 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2414 on 8465 degrees of freedom
## Multiple R-squared: 0.1826, Adjusted R-squared: 0.1815
## F-statistic: 157.6 on 12 and 8465 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(mod_month_total_volume)
mod_month_year <- lm(average_price ~ month + year, data = avocado_tidy_conv)
mod_month_year
##
## Call:
## lm(formula = average_price ~ month + year, data = avocado_tidy_conv)
##
## Coefficients:
## (Intercept) month10 month11 month12 month2 month3
## 0.94914 0.31239 0.18127 0.03583 -0.03180 0.10006
## month4 month5 month6 month7 month8 month9
## 0.10078 0.06821 0.11811 0.17554 0.21061 0.26365
## year2016 year2017
## 0.02771 0.21814
summary(mod_month_year)
##
## Call:
## lm(formula = average_price ~ month + year, data = avocado_tidy_conv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.74855 -0.15728 -0.00253 0.14710 0.91188
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.949141 0.009103 104.264 < 2e-16 ***
## month10 0.312394 0.011720 26.656 < 2e-16 ***
## month11 0.181267 0.011954 15.163 < 2e-16 ***
## month12 0.035826 0.011946 2.999 0.00272 **
## month2 -0.031801 0.012201 -2.606 0.00916 **
## month3 0.100056 0.011954 8.370 < 2e-16 ***
## month4 0.100783 0.011946 8.437 < 2e-16 ***
## month5 0.068214 0.011728 5.817 6.23e-09 ***
## month6 0.118107 0.012201 9.680 < 2e-16 ***
## month7 0.175542 0.011720 14.979 < 2e-16 ***
## month8 0.210612 0.011954 17.618 < 2e-16 ***
## month9 0.263647 0.012201 21.609 < 2e-16 ***
## year2016 0.027709 0.006094 4.547 5.52e-06 ***
## year2017 0.218141 0.006071 35.929 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2279 on 8464 degrees of freedom
## Multiple R-squared: 0.2719, Adjusted R-squared: 0.2708
## F-statistic: 243.2 on 13 and 8464 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(mod_month_year)
mod_month_week <- lm(average_price ~ month + week, data = avocado_tidy_conv)
mod_month_week
##
## Call:
## lm(formula = average_price ~ month + week, data = avocado_tidy_conv)
##
## Coefficients:
## (Intercept) month10 month11 month12 month2 month3
## 0.645495 0.620809 0.513111 0.418515 -0.003386 0.155117
## month4 month5 month6 month7 month8 month9
## 0.206690 0.189894 0.283595 0.381152 0.439650 0.531940
## week
## 0.007908
summary(mod_month_week)
##
## Call:
## lm(formula = average_price ~ month + week, data = avocado_tidy_conv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.72396 -0.17540 -0.02266 0.16719 0.87043
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.645495 0.102075 6.324 2.68e-10 ***
## month10 0.620809 0.081114 7.654 2.17e-14 ***
## month11 0.513111 0.090289 5.683 1.37e-08 ***
## month12 0.418515 0.099054 4.225 2.41e-05 ***
## month2 -0.003386 0.015960 -0.212 0.831989
## month3 0.155117 0.021750 7.132 1.07e-12 ***
## month4 0.206690 0.029332 7.047 1.98e-12 ***
## month5 0.189894 0.037857 5.016 5.38e-07 ***
## month6 0.283595 0.046435 6.107 1.06e-09 ***
## month7 0.381152 0.054902 6.942 4.14e-12 ***
## month8 0.439650 0.063978 6.872 6.78e-12 ***
## month9 0.531940 0.072430 7.344 2.26e-13 ***
## week 0.007908 0.002054 3.850 0.000119 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2475 on 8465 degrees of freedom
## Multiple R-squared: 0.1412, Adjusted R-squared: 0.14
## F-statistic: 116 on 12 and 8465 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(mod_month_week)
mod_month_season <- lm(average_price ~ month + season, data = avocado_tidy_conv)
mod_month_season
##
## Call:
## lm(formula = average_price ~ month + season, data = avocado_tidy_conv)
##
## Coefficients:
## (Intercept) month10 month11 month12 month2
## 1.03694 0.31239 0.16911 0.04045 -0.03765
## month3 month4 month5 month6 month7
## 0.08790 0.10541 0.05263 0.11225 0.17554
## month8 month9 seasonspring seasonsummer seasonwinter
## 0.19845 0.25779 NA NA NA
summary(mod_month_season)
##
## Call:
## lm(formula = average_price ~ month + season, data = avocado_tidy_conv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.71474 -0.17605 -0.02249 0.16751 0.87066
##
## Coefficients: (3 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.036944 0.009007 115.125 < 2e-16 ***
## month10 0.312394 0.012738 24.525 < 2e-16 ***
## month11 0.169110 0.012981 13.028 < 2e-16 ***
## month12 0.040449 0.012981 3.116 0.00184 **
## month2 -0.037654 0.013258 -2.840 0.00452 **
## month3 0.087899 0.012981 6.772 1.36e-11 ***
## month4 0.105406 0.012981 8.120 5.31e-16 ***
## month5 0.052632 0.012738 4.132 3.63e-05 ***
## month6 0.112253 0.013258 8.467 < 2e-16 ***
## month7 0.175542 0.012738 13.781 < 2e-16 ***
## month8 0.198454 0.012981 15.288 < 2e-16 ***
## month9 0.257793 0.013258 19.444 < 2e-16 ***
## seasonspring NA NA NA NA
## seasonsummer NA NA NA NA
## seasonwinter NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2477 on 8466 degrees of freedom
## Multiple R-squared: 0.1397, Adjusted R-squared: 0.1386
## F-statistic: 125 on 11 and 8466 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(mod_month_season)
remaining_resid <- avocado_tidy_conv %>%
add_residuals(mod_month_year) %>%
select(-c(average_price, month, year))
remaining_resid %>%
ggpairs(aes(colour = season, alpha = 0.5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
mod_month_year_total_volume <- lm(average_price ~ month + year + log(total_volume), data = avocado_tidy_conv)
mod_month_year_total_volume
##
## Call:
## lm(formula = average_price ~ month + year + log(total_volume),
## data = avocado_tidy_conv)
##
## Coefficients:
## (Intercept) month10 month11 month12
## 1.51701 0.30222 0.17068 0.03033
## month2 month3 month4 month5
## -0.02822 0.09810 0.10099 0.07387
## month6 month7 month8 month9
## 0.12135 0.17563 0.20790 0.25789
## year2016 year2017 log(total_volume)
## 0.03238 0.22297 -0.04336
summary(mod_month_year_total_volume)
##
## Call:
## lm(formula = average_price ~ month + year + log(total_volume),
## data = avocado_tidy_conv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.69624 -0.15698 -0.00183 0.14641 0.91822
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.517006 0.025174 60.262 < 2e-16 ***
## month10 0.302221 0.011346 26.636 < 2e-16 ***
## month11 0.170679 0.011574 14.747 < 2e-16 ***
## month12 0.030325 0.011559 2.623 0.00872 **
## month2 -0.028222 0.011805 -2.391 0.01683 *
## month3 0.098104 0.011566 8.482 < 2e-16 ***
## month4 0.100985 0.011557 8.738 < 2e-16 ***
## month5 0.073872 0.011348 6.509 7.97e-11 ***
## month6 0.121354 0.011805 10.280 < 2e-16 ***
## month7 0.175628 0.011338 15.490 < 2e-16 ***
## month8 0.207896 0.011566 17.975 < 2e-16 ***
## month9 0.257891 0.011806 21.844 < 2e-16 ***
## year2016 0.032376 0.005899 5.488 4.17e-08 ***
## year2017 0.222975 0.005877 37.938 < 2e-16 ***
## log(total_volume) -0.043357 0.001801 -24.080 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2204 on 8463 degrees of freedom
## Multiple R-squared: 0.3186, Adjusted R-squared: 0.3175
## F-statistic: 282.7 on 14 and 8463 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(mod_month_year_total_volume)
mod_month_year_season <- lm(average_price ~ month + year + season, data = avocado_tidy_conv)
mod_month_year_season
##
## Call:
## lm(formula = average_price ~ month + year + season, data = avocado_tidy_conv)
##
## Coefficients:
## (Intercept) month10 month11 month12 month2
## 0.94914 0.31239 0.18127 0.03583 -0.03180
## month3 month4 month5 month6 month7
## 0.10006 0.10078 0.06821 0.11811 0.17554
## month8 month9 year2016 year2017 seasonspring
## 0.21061 0.26365 0.02771 0.21814 NA
## seasonsummer seasonwinter
## NA NA
summary(mod_month_year_season)
##
## Call:
## lm(formula = average_price ~ month + year + season, data = avocado_tidy_conv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.74855 -0.15728 -0.00253 0.14710 0.91188
##
## Coefficients: (3 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.949141 0.009103 104.264 < 2e-16 ***
## month10 0.312394 0.011720 26.656 < 2e-16 ***
## month11 0.181267 0.011954 15.163 < 2e-16 ***
## month12 0.035826 0.011946 2.999 0.00272 **
## month2 -0.031801 0.012201 -2.606 0.00916 **
## month3 0.100056 0.011954 8.370 < 2e-16 ***
## month4 0.100783 0.011946 8.437 < 2e-16 ***
## month5 0.068214 0.011728 5.817 6.23e-09 ***
## month6 0.118107 0.012201 9.680 < 2e-16 ***
## month7 0.175542 0.011720 14.979 < 2e-16 ***
## month8 0.210612 0.011954 17.618 < 2e-16 ***
## month9 0.263647 0.012201 21.609 < 2e-16 ***
## year2016 0.027709 0.006094 4.547 5.52e-06 ***
## year2017 0.218141 0.006071 35.929 < 2e-16 ***
## seasonspring NA NA NA NA
## seasonsummer NA NA NA NA
## seasonwinter NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2279 on 8464 degrees of freedom
## Multiple R-squared: 0.2719, Adjusted R-squared: 0.2708
## F-statistic: 243.2 on 13 and 8464 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(mod_month_year_season)
average_price_residual <- avocado_tidy_conv %>%
add_residuals(mod_month_year_total_volume) %>%
select(-average_price)
coplot(resid ~ log(total_volume) | month,
panel = function(x, y, ...){
points(x, y)
abline(lm(y ~ x), col = "blue")
},
data = average_price_residual, columns=6)
average_price_residual %>%
ggplot(aes(x = total_volume, y = resid, colour = season)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
mod_interaction1 <- lm(average_price ~ month + year + total_volume + month:year, data = avocado_tidy_conv)
summary(mod_interaction1)
##
## Call:
## lm(formula = average_price ~ month + year + total_volume + month:year,
## data = avocado_tidy_conv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.72198 -0.12933 0.00136 0.13026 0.84243
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.100e+00 1.413e-02 77.865 < 2e-16 ***
## month10 -2.875e-02 1.995e-02 -1.441 0.149742
## month11 -6.802e-02 1.893e-02 -3.593 0.000328 ***
## month12 -8.672e-02 1.995e-02 -4.346 1.40e-05 ***
## month2 -3.606e-02 1.995e-02 -1.807 0.070737 .
## month3 -2.179e-03 1.893e-02 -0.115 0.908348
## month4 2.500e-02 1.995e-02 1.253 0.210341
## month5 -3.090e-04 1.893e-02 -0.016 0.986978
## month6 -4.434e-03 1.995e-02 -0.222 0.824159
## month7 2.237e-02 1.995e-02 1.121 0.262254
## month8 2.500e-02 1.893e-02 1.321 0.186621
## month9 -1.688e-02 1.995e-02 -0.846 0.397537
## year2016 -1.051e-01 1.893e-02 -5.550 2.95e-08 ***
## year2017 -4.635e-02 1.893e-02 -2.449 0.014364 *
## total_volume -5.223e-09 4.852e-10 -10.766 < 2e-16 ***
## month10:year2016 3.939e-01 2.677e-02 14.713 < 2e-16 ***
## month11:year2016 4.574e-01 2.677e-02 17.084 < 2e-16 ***
## month12:year2016 1.781e-01 2.750e-02 6.475 9.99e-11 ***
## month2:year2016 -1.139e-02 2.750e-02 -0.414 0.678800
## month3:year2016 1.775e-02 2.677e-02 0.663 0.507346
## month4:year2016 -4.940e-02 2.750e-02 -1.796 0.072543 .
## month5:year2016 -4.784e-02 2.602e-02 -1.839 0.065989 .
## month6:year2016 9.171e-02 2.750e-02 3.334 0.000859 ***
## month7:year2016 1.769e-01 2.677e-02 6.606 4.19e-11 ***
## month8:year2016 1.685e-01 2.677e-02 6.296 3.21e-10 ***
## month9:year2016 2.184e-01 2.750e-02 7.942 2.25e-15 ***
## month10:year2017 5.555e-01 2.677e-02 20.749 < 2e-16 ***
## month11:year2017 2.823e-01 2.677e-02 10.543 < 2e-16 ***
## month12:year2017 1.751e-01 2.677e-02 6.540 6.52e-11 ***
## month2:year2017 -1.268e-03 2.750e-02 -0.046 0.963228
## month3:year2017 2.489e-01 2.677e-02 9.298 < 2e-16 ***
## month4:year2017 2.382e-01 2.677e-02 8.899 < 2e-16 ***
## month5:year2017 2.367e-01 2.677e-02 8.843 < 2e-16 ***
## month6:year2017 2.490e-01 2.751e-02 9.051 < 2e-16 ***
## month7:year2017 2.514e-01 2.677e-02 9.389 < 2e-16 ***
## month8:year2017 3.683e-01 2.677e-02 13.756 < 2e-16 ***
## month9:year2017 5.908e-01 2.751e-02 21.479 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2074 on 8441 degrees of freedom
## Multiple R-squared: 0.3986, Adjusted R-squared: 0.396
## F-statistic: 155.4 on 36 and 8441 DF, p-value: < 2.2e-16
mod_interaction2 <- lm(average_price ~ month + year + total_volume + month:total_volume, data = avocado_tidy_conv)
summary(mod_interaction2)
##
## Call:
## lm(formula = average_price ~ month + year + total_volume + month:total_volume,
## data = avocado_tidy_conv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.74814 -0.15598 0.00139 0.14698 0.90601
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.600e-01 9.488e-03 101.183 < 2e-16 ***
## month10 3.081e-01 1.235e-02 24.945 < 2e-16 ***
## month11 1.795e-01 1.259e-02 14.254 < 2e-16 ***
## month12 3.435e-02 1.259e-02 2.729 0.00637 **
## month2 -3.052e-02 1.284e-02 -2.378 0.01745 *
## month3 9.806e-02 1.259e-02 7.786 7.71e-15 ***
## month4 9.963e-02 1.259e-02 7.916 2.76e-15 ***
## month5 6.587e-02 1.236e-02 5.332 9.98e-08 ***
## month6 1.158e-01 1.286e-02 9.004 < 2e-16 ***
## month7 1.701e-01 1.235e-02 13.779 < 2e-16 ***
## month8 2.064e-01 1.260e-02 16.383 < 2e-16 ***
## month9 2.604e-01 1.285e-02 20.260 < 2e-16 ***
## year2016 2.848e-02 6.058e-03 4.701 2.63e-06 ***
## year2017 2.190e-01 6.036e-03 36.288 < 2e-16 ***
## total_volume -6.682e-09 1.687e-09 -3.962 7.51e-05 ***
## month10:total_volume 1.257e-09 2.778e-09 0.452 0.65109
## month11:total_volume -6.043e-10 2.838e-09 -0.213 0.83140
## month12:total_volume 5.215e-11 2.619e-09 0.020 0.98411
## month2:total_volume 1.597e-12 2.328e-09 0.001 0.99945
## month3:total_volume 8.612e-10 2.517e-09 0.342 0.73223
## month4:total_volume 6.403e-10 2.442e-09 0.262 0.79319
## month5:total_volume 1.892e-09 2.280e-09 0.830 0.40663
## month6:total_volume 1.647e-09 2.425e-09 0.679 0.49706
## month7:total_volume 3.053e-09 2.421e-09 1.261 0.20721
## month8:total_volume 2.115e-09 2.566e-09 0.824 0.41000
## month9:total_volume 1.086e-09 2.724e-09 0.399 0.69019
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2265 on 8452 degrees of freedom
## Multiple R-squared: 0.2818, Adjusted R-squared: 0.2797
## F-statistic: 132.7 on 25 and 8452 DF, p-value: < 2.2e-16
mod_interaction3 <- lm(average_price ~ month + year + total_volume + year:total_volume, data = avocado_tidy_conv)
summary(mod_interaction3)
##
## Call:
## lm(formula = average_price ~ month + year + total_volume + year:total_volume,
## data = avocado_tidy_conv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.74847 -0.15549 0.00075 0.14675 0.90728
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.575e-01 9.164e-03 104.480 < 2e-16 ***
## month10 3.100e-01 1.165e-02 26.623 < 2e-16 ***
## month11 1.791e-01 1.188e-02 15.076 < 2e-16 ***
## month12 3.461e-02 1.187e-02 2.916 0.00355 **
## month2 -3.073e-02 1.212e-02 -2.535 0.01126 *
## month3 9.945e-02 1.188e-02 8.374 < 2e-16 ***
## month4 1.007e-01 1.187e-02 8.482 < 2e-16 ***
## month5 6.915e-02 1.165e-02 5.934 3.07e-09 ***
## month6 1.186e-01 1.212e-02 9.781 < 2e-16 ***
## month7 1.752e-01 1.164e-02 15.047 < 2e-16 ***
## month8 2.097e-01 1.188e-02 17.657 < 2e-16 ***
## month9 2.621e-01 1.212e-02 21.619 < 2e-16 ***
## year2016 2.878e-02 6.414e-03 4.487 7.32e-06 ***
## year2017 2.211e-01 6.387e-03 34.608 < 2e-16 ***
## total_volume -5.069e-09 9.808e-10 -5.168 2.42e-07 ***
## year2016:total_volume -2.268e-10 1.327e-09 -0.171 0.86424
## year2017:total_volume -1.328e-09 1.320e-09 -1.005 0.31469
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2264 on 8461 degrees of freedom
## Multiple R-squared: 0.2816, Adjusted R-squared: 0.2803
## F-statistic: 207.3 on 16 and 8461 DF, p-value: < 2.2e-16
relaimpo::calc.relimp(mod_month_year_total_volume, type = "lmg", rela = TRUE)
## Response variable: average_price
## Total response variance: 0.0711996
## Analysis based on 8478 observations
##
## 14 Regressors:
## Some regressors combined in groups:
## Group month : month10 month11 month12 month2 month3 month4 month5 month6 month7 month8 month9
## Group year : year2016 year2017
##
## Relative importance of 3 (groups of) regressors assessed:
## month year log(total_volume)
##
## Proportion of variance explained by model: 31.86%
## Metrics are normalized to sum to 100% (rela=TRUE).
##
## Relative importance metrics:
##
## lmg
## month 0.4257359
## year 0.4196833
## log(total_volume) 0.1545809
##
## Average coefficients for different model sizes:
##
## 1group 2groups 3groups
## month10 0.31239418 0.30752078 0.30222082
## month11 0.16910969 0.16989142 0.17067942
## month12 0.04044872 0.03545544 0.03032544
## month2 -0.03765432 -0.03312167 -0.02822223
## month3 0.08789886 0.09281706 0.09810356
## month4 0.10540598 0.10314432 0.10098503
## month5 0.05263228 0.06296795 0.07387167
## month6 0.11225309 0.11662733 0.12135449
## month7 0.17554233 0.17558347 0.17562821
## month8 0.19845442 0.20300704 0.20789596
## month9 0.25779321 0.25785426 0.25789072
## year2016 0.02763177 0.03026475 0.03237644
## year2017 0.21692523 0.22012532 0.22297489
## log(total_volume) -0.04544362 -0.04436619 -0.04335656